Good Data Habits

Starting as you mean to go on

24-Nov-2025

File naming conventions

Organizing files

Which of these files contains the most recent version of the data?

$ ls -l data/
-rw-r--r--. 1 hannah hannah 0 Sep 23 11:38 sample_metadata_clean.tsv
-rw-r--r--. 1 hannah hannah 0 Sep 23 11:37 sample_metadata.tsv
-rw-r--r--. 1 hannah hannah 0 Sep 23 11:38 sample_metadata_USE_THIS_ONE.tsv
-rw-r--r--. 1 hannah hannah 0 Sep 23 11:38 sample_metadataV2_final.tsv
-rw-r--r--. 1 hannah hannah 0 Sep 23 11:38 sample_metadataV2.tsv

File naming conventions

What makes a file name useful? Metadata

  • Project name or acronym
  • Study title
  • Location
  • Data type
  • Researcher initials
  • Date
  • Data stage (raw, filtered, etc.)
  • Version number
  • File type
  • If script, names should describe what they do

Task of the day:

  • rename 3 files

Directory structures - Get organized!

Last time on 10-minute data science…

We discussed file naming conventions and you were supposed to rename 3 files. How did that go?

Directory Structures

Where should you look to find the latest version of protocol you’re interested in testing?

Our lab’s sharepoint is a good example of what not to do…

Directory Structures

Which enzyme assay is the one you want?

Best practices

  • Choose an organizational style; stick with it

  • If sharing (with colleagues or future you!), document the organizational style

  • Divide work into project directories.

    • Thesis
      • Chapters
    • Papers
      • Sections
    • Grants
      • subprojects or papers

Take home: Project directories should be self-contained and hold all files needed to go from raw data to final results

Organizing around project directories

There’s no one best way to organize a project but…

  • Data is Read-Only! (Your most important goal for setting up a project directory)

  • Store data-cleaning scripts in a separate folder - create a second read-only ‘clean data’ folder

  • Generated output is disposable (anything generated by your scripts should be able to be deleted with no concern)*

  • Save your useful code by wrapping it in share-able functions

Subdirectory choice

What subdirectories do folks use?

What questions should you ask when creating a new subdirectory?

Example 1: ARCSS Grant

  • A project I joined when I started working here
  • Organized around anything relevant to the grant
  • Includes both sub-“project” directories, but also writing, administrative information, literature
  • sub-projects are tracked with version control software, but not this directory
 Conferences/     Conference presentations, trave administrative documents
 Sean_qsip_tree/  Project file for creating a phylogenetic tree with Sean's qSIP project
 Literature/      Relevant literature for ARCSS project (automatically integrated into Zotero/Mendeley libraries)
 Senescence/      Project to identify likely senescence times for our sites
 mimics_webapp/   Project for Stuart's hairbrained (but genius idea) to turn MIMICS into a webapp
 Picarro Code/    Nacent code for processing Picarro outputs
 useful_images/   Helpful images related to the project. Often useful in creating figures or presentations
 Protocols/       Protocols related to lab work
 Writing/         Writing folder; includes derived grants, manuscripts, etc.
 qsip/            FICUS qsip project

Example 2: The temporal paper

  • Self-contained project
  • highly collaborative; structure is co-created with others
  • designed to be tracked with version control software from day 1
 Assembly-analysis/                     Sub-analyses; files contain code, outputs, figures
 cazyme_scraper/                        Shortcut to a different project file, where I wrote a code pipeline
 CN_versatility/                        Sub-analyses; files contain code, outputs, figures
 Core_microbiome/                       Sub-analyses; files contain code, outputs, figures
 data/                                  Raw data; files never edited; common across collaborators; contains both shortcuts to large data sets and actual files
 general_climate_weather/               Sub-analyses; files contain code, outputs, figures
 GraftM-analysis/                       Collaborators's sub-analyses; I don't have to edit anything in here
 identifying-outlier-years/             Sub-analyses; files contain code, outputs, figures
 identify-temp-WTD-responders/          Sub-analyses; files contain code, outputs, figures
 Metabolic-analysis/                    Collaborators's sub-analyses; I don't have to edit anything in here
 metadata_availability/                 Sub-analyses; files contain code, outputs, figures
 quantify_stability_with_time_figure/   Sub-analyses; files contain code, outputs, figures
 SingleM-analysis/                      Sub-analyses; files contain code, outputs, figures
 setup.R                                Common analysis script that takes raw data and does initial cleaning
 README.md                              Readme file; describes how to setup the code and data on your own computer
 temporal_paper.yml                     Contains instructions for installing the software necessary for running all the code in the project
 install_dependencies.sh                Secondary installation script for software not covered by temporal_paper.yml

Example 3: The dada2 pipeline

  • Purpose: Tutorial/ pipeline
    • Doesn’t have unique raw data
  • Output folders generated by code
  • Emphasis on portability to other computers
R/             Rscripts live here - they include documentation in the form of R-markdown
slurm/         slurm scripts for submitting to supercomputer live here
dada2_ernakovich.yml  Installation and software information
README.md/     Tutorial information 

Not sure which is best? Templates exist!

Heidi Seibold’s Research Project Template

.
├── README.md
├── analysis            <- all things data analysis
│   └── src             <- functions and other source files
├── comm
│   ├── internal_comm   <- internal communication such as meeting notes
│   └── journal_comm    <- communication with the journal, e.g. peer review
├── data
│   ├── data_clean      <- clean version of the data
│   └── data_raw        <- raw data (don't touch)
├── dissemination
│   ├── manuscripts
│   ├── posters
│   └── presentations
├── documentation       <- documentation, e.g. data management plan
└── misc                <- miscellaneous files that don't fit elsewhere

Taking project folders to the next level

Project folders allow you to take advantage of coding and project management tools

  • Most IDEs (Integrated Development Environments, e.g. Rstudio) are set up to allow users to work in and switch easily between projects

  • git version tracking - For tracking your code and files, you set up version tracking in a project folder.

  • Sharing a project is easy - simply share the project folder with the collaborator

Ernakovich Lab Discussion

  • Determine organization norms
  • Reorganize Ernakovich Sharepoint
  • Create a Guide (‘readme’) for directories

Task of the day

Either: Create an R-project for code you’re working on now

Or: Think about your preferred management style; Reorganize a project directory around that style. (Don’t forget to document it!)

Metadata

Last time on 10-minute data science…

We discussed directory structures. How did your tasks go?

What is metadata?

  • Descriptive metadata: information about the content and context of your data.

    Examples: title, creator, subject keywords, and description (abstract)

  • Structural metadata describe the physical structure of compound data.

    Examples: camera used, aperture, exposure, file format, and relation to other data or files

  • Administrative metadata used to manage your data

    Examples: when and how they were created, who can access them, directory structure, software required to use them, and copyright permissions

What should metadata include?

  • Units
  • Resolution
  • Meaning of column names
  • Description of caveats, issues, or missing values
  • How data was collected
  • Filtering or processing steps the data has been through (if applicable)

What do you do if you don’t know what kind of metadata to include?

  • look it up (many data types have standards)
  • MIMARKS (Minimum information about a marker gene sequence)
  • MIMAGS (Minimum information about a metagenome-assembled genome) for microbial genomics data)
  • phone a friend

Where should metadata be stored?

  • Depends on the project
  • excel files - in the first tab
  • CSV/ text files - in a “readme.txt” file that describes all data in the folder
  • Project directories - “README.md” files
  • Other thoughts?

Task of the day

Either:

  1. Create a metadata file for a raw dataset that does not yet have one
  2. Create a metadata file for a project directory that describes its structure and usage

Weekly and Daily Checklists

Last time on 10-minute data science…

We discussed metadata and README files. What was challenging about creating readme files?

What are some habits you have at work?

  • checking email
  • wearing gloves when handling chemicals
  • maintaining a lab notebook

Establishing Good Data Habits

Good data habits can be implemented regardless of your experience or computational skill level

Today we’ll go through some check-lists you can use to help cultivate good data and coding habits

When starting a project

When you receive (or collect) data

When beginning to analyze data

At the end of the day

Task of the day

  1. Create and implement a daily checklist for working with data

Congratulations!!

You’ve made it through “Good Data Habits!”

Choose Your Next Adventure: